Systematic Evaluation of Neural Retrieval Models on the Touch\'e 2020 Argument Retrieval Subset of BEIR

Thakur, Nandan; Bonifacio, Luiz; Fröbe, Maik; Bondarenko, Alexander; Kamalloo, Ehsan; Potthast, Martin; Hagen, Matthias; Lin, Jimmy

doi:10.1145/3626772.3657861

Computer Science > Information Retrieval

arXiv:2407.07790 (cs)

[Submitted on 10 Jul 2024]

Title:Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

Authors:Nandan Thakur, Luiz Bonifacio, Maik Fröbe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin

View PDF HTML (experimental)

Abstract:The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touché 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touché 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touché 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touché 2020 data. As many of the short Touché passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the Touché 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the Touché guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented Touché 2020 dataset are available at \url{this https URL}.

Comments:	SIGIR 2024 (Resource & Reproducibility Track)
Subjects:	Information Retrieval (cs.IR)
Cite as:	arXiv:2407.07790 [cs.IR]
	(or arXiv:2407.07790v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2407.07790
Related DOI:	https://doi.org/10.1145/3626772.3657861

Submission history

From: Nandan Thakur [view email]
[v1] Wed, 10 Jul 2024 16:07:51 UTC (164 KB)

Computer Science > Information Retrieval

Title:Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators